Overview of Topics

  • Introduction & Motivating Example
  • Multi-State AD-PCA
  • Simulation Study
  • Case Study Results
  • Software Package for R
  • Summary and References

Introduction & Motivating Example

Introduction

  • Many factories or other closed production systems use real-time online process monitoring.
  • We aim to detect potential problems within the system before they cause damage or cause the process to shut down.
  • Data description:
    • non-Gaussian process
    • multiple correlated variables
    • temporal dependence of observations
    • non-stationary due to operator input an ambient conditions

Process Dimensionality

  • In multivariate contexts, the number of parameters to estimate increases quadratically with dimension.
  • That is, we must estimate \(p + \frac{p}{2}(p + 1) \in \mathbb{O}(p^2)\) first- and second-moment parameters if we have \(p\) features.
  • Linear dimension reduction (LDR) allows us to simplify complex data structures to more simple data structures via linear combinations of features.
  • LDR is built around eigen-decomposition / principal component analysis (PCA) or the singular value decomposition (SVD).

Facility and Process Details

  • This work is motivated by our partnership with the Colorado School of Mines on the ReNUWit water preservation grant.
  • Our team manages a decentralised wastewater treatment (WWT) plant in Golden, CO.
  • We measure 40 features and their lagged values (80 total features).
  • The continuous features are aggregated to the minute-level.
  • We aim to quickly and accurately detect process malfunctions before human operators by incorporating known system state behavior.
  • Stock PCA applied to WWT by Wise et al. (1988) and chemical process monitoring by Kresta, MacGregor, and Marlin (1991)
  • Baggiani and Marsili-Libelli (2009) used adaptive PCA, Sanchez-Fernandez, Fuente, and Sainz-Palmero (2015) used distributive PCA, and Kazor et al. (2016) used adaptive-dynamic PCA to monitor WWT processes.

The Bioreactor System

Multi-State Process

Example Feature by Blower

Multi-State AD-PCA

PCA Basics

  • \(\textbf{X} := [\textbf{X}_1\ \vdots\ \textbf{X}_2\ \vdots\ \cdots\ \vdots\ \textbf{X}_n]^T \in \mathbb{R}_{n\times p}\), where \(\textbf{X}_s := \left(X_1(s), X_2(s), \ldots, X_p(s)\right)\) is a realization from process \(\mathcal{P}\) on index \(s = 1, \ldots, n\).
  • \(\textbf{X}^T\textbf{X} := \textbf{P} \times \boldsymbol\Lambda \times \textbf{P}^T\) is the eigendecomposition of the scatter matrix.
  • \(\textbf{P}^T\) is the \(p \times p\) orthogonal matrix of eigenvectors.
  • \(\boldsymbol\Lambda\) is the diagonal matrix of the \(p\) eigenvalues.
  • \(\textbf{P}_d\) is the \(p\times d\) projection matrix preserving the total data variation (energy) corresponding to \[ \left[\sum\limits_{i = 1}^d \lambda_i\right] / \text{tr}\left(\boldsymbol\Lambda\right). \]
  • \(\textbf{Y}_s := \textbf{X}_s\textbf{P}_d \in\mathbb{R}_{1\times d}\) is the reduced-dimension representation of observation \(\textbf{X}_s\).

Competing Methods

PCA is non-optimal because of the autocorrelation and non-linearity / non-stationarity of the data. Some alternatives are:

  • Adaptive-Dynamic PCA (AD-PCA) of Kazor et al. (2016)
  • Kernel PCA (kPCA) of Ge, Yang, and Song (2009)
  • Adaptive kPCA (AkPCA) of Chouaib, Mohamed-Faouzi, and Messaoud (2013)
  • Local Linear Embedding (LLE) Miao et al. (2013)
  • Multi-dimensional scaling and isometric mapping (IsoMap) of Tenenbaum, Silva, and Langford (2000)
  • Semidefinite Embedding (SDE) / Maximimum Variance Unfolding (MVU) of Weinberger and Saul (2006)

We combine Adaptive, Dynamic, and Multi-State modifications to PCA.

Adaptive PCA

  • Assume \(\mathcal{P}\) is locally linear within a window \(w + n_u\).
  • Create a \(w\)-width rolling training window preceding index \(s\) by defining \[ \textbf{X}_{w} := [\textbf{X}_{s - w + 1}\ \vdots\ \textbf{X}_{s - w + 2}\ \vdots\ \cdots\ \vdots\ \textbf{X}_{s - 1}\ \vdots\ \textbf{X}_s]^T. \]
  • Calculate the scatter matrix with \(\textbf{X}_{w}\) instead of the full data matrix \(\textbf{X}\).
  • Estimate \(\textbf{P}_d\).
  • After performing necessary monitoring statistics on the new observations, ''learn'' the newest \(n_u\) observations, ''forget'' the oldest \(n_u\) observations, and estimate the new scatter matrix.

Dynamic PCA

  • Because the features are correlated over time, we include up to \(\ell\) lags per feature.
  • The observation vector at index \(s\) is now \[ \begin{align} \text{L}(\textbf{X}_s) := &[X_1(s), X_1\left( s - 1 \right), \ldots, X_1\left( s - \ell \right),\ X_2(s), X_2\left( s - 1 \right), \ldots, X_2\left( s - \ell \right), \\ &\qquad \cdots,\ X_p(s), X_p\left( s - 1 \right), \ldots, X_p\left( s - \ell \right)]. \end{align} \]
  • These \(n - \ell\) rows form the lagged data matrix \[ \text{L}(\textbf{X}) := [\text{L}(\textbf{X}_{\ell})\ \vdots\ \text{L}(\textbf{X}_{\ell + 1})\ \vdots\ \cdots\ \vdots\ \text{L}(\textbf{X}_n)]^T \in\mathbb{R}_{(n - \ell) \times p(\ell + 1)}. \]
  • Calculate the scatter matrix with \(\text{L}(\textbf{X})\) instead of the data matrix \(\textbf{X}\).
  • Estimate the \(p(\ell + 1) \times d\) projection matrix \(\textbf{P}_d\).

Multi-State PCA

  • Let \(\mathcal{S}_k,\ k = 1, \ldots, K\) be a set of disjoint process states.
  • For the index \(s\), the observation \(\textbf{X}_{s}\) is said to belong in state \(\mathcal{S}_k\) if and only if \[ \textbf{X}_{s} \sim f_k\left( \boldsymbol\mu_k, \boldsymbol\Sigma_k \right). \]
  • \(f_k\) is some distribution with location parameter \(\boldsymbol\mu_k \in \mathbb{R}_{p\times 1}\) and \(\boldsymbol\Sigma_k \in \mathbb{R}^{\ge}_{p\times p}\)
  • Because the distribution changes from state to state, the principal components are also a function of the state.
  • Partition \(\textbf{X}\) as \(\left\{\textbf{X}^{(1)}, \textbf{X}^{(2)}, \ldots, \textbf{X}^{(K)}\right\}\), where \(\textbf{X}^{(k)}\) is the ordered set of observations under state \(\mathcal{S}_k\).
  • Estimate \(K\) distinct projection matrices \(\left\{\textbf{P}_d^{(1)}, \textbf{P}_d^{(2)}, \ldots, \textbf{P}_d^{(K)}\right\}\) from the partitioned \(\textbf{X}\).

Monitoring Statistics

These statistics are estimated non-parametrically, as discussed by Kazor et al. (2016)

Hotelling's \(T^2\)

  • The \(T^2\) statistic is the Mahalanobis distance of the mapped value \(\textbf{Y}_{s + 1}\) from the original space into the PCA-subspace.
  • The \(T^2\) statistic is \(T^2_{s + 1} = \textbf{Y}_{s + 1}\boldsymbol\Lambda_d^{-1} \textbf{Y}_{s + 1}^T\), for \(\boldsymbol\Lambda_d := \text{diag}\left( \lambda_1, \lambda_2, \ldots, \lambda_d \right)\).

Squared Prediction Error

  • The SPE statistic measures the goodness-of-fit of the \(d\)-dimensional model to the \(p\)-dimensional process observations, or how well \(\textbf{P}_d\) approximates \(\textbf{P}\).
  • The SPE statistic is \(\text{SPE}(\textbf{Y}_{s + 1}) = \left( \textbf{X}_{s + 1} - \textbf{Y}_{s + 1}\textbf{P}^T_d \right) \left( \textbf{X}_{s + 1} - \textbf{Y}_{s + 1}\textbf{P}^T_d \right)^T.\)

Simulation of NOC Observations

Details of Data Generation

We follow Kazor et al. (2016) and continue updating the original design of Dong and McAvoy (1996). Thus

  • the errors of \(t\), \(\epsilon_t\), be drawn from a first-order autoregressive process over \(s\), where \(\varepsilon \sim N(0, \sigma_{\varepsilon})\),
  • The process dimension \(p = 3\),
  • The cycle length is \(\omega = 7 * 24 * 60 = 10,080\), corresponding to one observation each minute over one week,
  • The observation index \(s = 1, \ldots, \omega\),
  • The observation at index \(s\) is given by \(\textbf{X}_s := \left\langle x_1 = X_1(t_s), x_2 = X_2(t_s), \ldots, x_p = X_p(t_s) \right\rangle\), and
  • The autocorrelation parameter \(\varphi = 0.75\).

Errors and Latent Feature Construction

  1. Draw the first innovation from \[ \varepsilon_1 \sim \mathcal{N}\left(\frac{1}{2}(a + b)(1 - \phi), \frac{b - a}{12}(1 - \phi ^ 2)\right), \] where \(a = 0.01\) and \(b = 2\). These mean and variance multipliers are from the mean and variance of the uniform\((a,b)\) distribution.
  2. Define the AR(1) error structure as \(\epsilon_s := \varphi \epsilon_{s - 1} + (1 - \varphi) \varepsilon,\quad s = 2, 3, \ldots\).
  3. Draw the latent feature \(t\) as \[ t_s := -\cos\left(\frac{2\pi}{\omega}s\right) + \varepsilon_s,\quad s = 1, \ldots, \omega. \]
  4. Scale \(t_s\) into \([a, b]\).

Feature and State Definitions

  1. Sample three machine-error vectors of length \(\omega\), \(\textbf{e}_1, \textbf{e}_2, \textbf{e}_3\overset{i.i.d.}{\sim}\mathcal{N}(0, 0.01)\).
  2. Define the three feature vectors to be \[ \begin{align*} x(t_s) &:= t_s + \textbf{e}_1(s) \\ y(t_s) &:= t_s^2 - 3t_s + \textbf{e}_2(s) \\ z(t_s) &:= -t_s^3 + 3t_s^2 + \textbf{e}_3(s). \end{align*} \]
  3. Change states hourly in sequence, where the three process states are

    • \(\mathcal{S}_1\): \(\textbf{X}(t_s) := \langle x(t_s), y(t_s), z(t_s) \rangle \cdot \textbf{I}\)
    • \(\mathcal{S}_2\): \(\textbf{X}(t_s) := \langle x(t_s), y(t_s), z(t_s) \rangle \cdot \textbf{P}_2 \boldsymbol\Lambda_2\)
    • \(\mathcal{S}_3\): \(\textbf{X}(t_s) := \langle x(t_s), y(t_s), z(t_s) \rangle \cdot \textbf{P}_3 \boldsymbol\Lambda_3\)

Process Projections

The process-specific scaling and projection matrices are: \[ \begin{gather} \textbf{P}_2 = \begin{bmatrix} 0 & 0.50 & -0.87 \\ 0 & 0.87 & 0.50 \\ 1 & 0 & 0 \end{bmatrix} & & \boldsymbol\Lambda_2 = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 0.5 & 0 \\ 0 & 0 & 2 \end{bmatrix} \\ & & \\ \textbf{P}_3 = \begin{bmatrix} 0 & 0.87 & -0.50 \\ -1 & 0 & 0 \\ 0 & 0.50 & 0.87 \end{bmatrix} & & \boldsymbol\Lambda_3 = \begin{bmatrix} 0.25 & 0 & 0 \\ 0 & 0.1 & 0 \\ 0 & 0 & 0.75 \end{bmatrix} \end{gather} \] \(\textbf{P}_2\) and \(\textbf{P}_3\) rotate the features so that the states are at right angles in at least one dimension. \(\boldsymbol\Lambda_2\) and \(\boldsymbol\Lambda_3\) inflate or deflate the feature variances along each principal component.

Example Projections Figure

ADD FIGURE

Monte Carlo Replication

We set the threshold to \(\alpha = 0.001\) and repeat the following 1000 times:

  1. Draw a random set of observations:
    • one set of NOC under the single-state model,
    • one set of NOC under the multi-state model.
  2. Apply each of the six single-state and nine multi-state faults (discussed in the next section) to both NOC data sets. There will be 17 total versions of the data set.
  3. Train the fault detection system with 75% of the observations:
    • 7,560 observations for the single-state model,
    • 2,520 observations per state for the multi-state model.
  4. Calculate AD- and MSAD-PCA projections for all 17 data sets.

Performance Metrics

An observation is flagged as suspicious if the monitoring statistic is beyond the calculated non-parametric threshold. An observation will trigger an alarm if it is the \(k^{th}\) observation in a row to be flagged. For this simulation study, we triggered an alarm at the third consecutive flag.

For each statistic under AD- and MSAD-PCA, measure

  • The false alarm rate from the NOC data sets. This is the number of false alarms recorded in the process observations without faults.
  • The fault detection indicator for each of the 15 fault data sets.
  • The time to detection; the earliest a statistic can trigger an alarm is after 3 minutes.

Fault Induction

Fault Overview

Induce one of the following faults at \(s = 8,500\).

Normal Process

A single draw from the multivariate process under NOC. The vertical black line (21:40 on 2 December) marks time at which the fault would be induced.

Fault 1B

Add 2 in all states to feature \(X\) only. This shift in feature \(X\) will infect features \(Y\) and \(Z\) through \(\textbf{P}_2\boldsymbol\Lambda_2\) and \(\textbf{P}_3\boldsymbol\Lambda_3\).

Fault 2C

Drift feature \(Y\) by a maximum \(-1.5\) only in state \(\mathcal{S}_2\). This fault is induced after state projections.

Fault 3A

Drift the latent variable \(t\) by a maximum of \(+6\) for all features in all states.

Simulation Study Results

False Alarm Rates

Detection Times

Comparison

Under the Multi-State model:

  • The AD-PCA SPE monitoring statistic is not sensitive enough to detect faults reliably.
  • The AD-PCA \(T^2\) monitoring statistic detects faults reliably, but has far too many false alarms.
  • The MSAD-PCA SPE monitoring statistic offers an excellent combination of sensitivity and high detection probability.
  • The MSAD-PCA \(T^2\) monitoring statistic offers an excellent combination of specificity and high detection probability.

Under the Single-State model, the pairs of AD-PCA and MSAD-PCA process monitoring statistics perform similarly, so we do not expect significant loss of power when using the MSAD-PCA procedure to detect faults.

Case Study Results

False Alarm Rates

We increased the flags to trigger an alarm from 3 to 5 for the real case due to strong serial autocorrelation. Note that these false alarm rates are all greater than or equal to the set threshold level of \(\alpha = 0.1\%\).

MSAD \(T^2\) MSAD SPE AD \(T^2\) AD SPE
False Alarm Rate 0.1% 0.3% 0.3% 0.3%

Fault Detection Times

Software Package for R

Package mvMonitoring

  • No website: our package is currently in private beta-testing at the facility.
  • mspProcessData() generates random draws from a serially autocorrelated and nonstationary multi-state (or single-state) multivariate process.
  • mspTrain() trains the projection matrix and non-parametric fault detection thresholds.
  • mspMonitor() assesses incoming process observations and classifies them as normal or abnormal.
  • mspWarning() keeps a running tally of abnormal observations and raises an alarm if necessary.
  • Alarms can be sent via email or SMS from the remote facility to the operators.

Summary and References

Summary

Overview

  • Explained the motivation for a multi-state modification to PCA.
  • Presented the MSAD-PCA dimension reduction procedure.
  • Described the simulation study used to validate the effectiveness of MSAD-PCA.
  • Discussed the simulation study and real data results.

MSAD-PCA Findings

  • Helps detect faults faster and more consistently in multi-state processes
  • Doesn't suffer loss of power when applied to single-state processes
  • Moves another step toward fault attribution

Future Work

References

Baggiani, F., and S. Marsili-Libelli. 2009. “Real-Time Fault Detection and Isolation in Biological Wastewater Treatment Plants.” Water Sci. Technol. 60 (11): 2949–61. doi:10.2166/wst.2009.723.

Chouaib, C., H. Mohamed-Faouzi, and D. Messaoud. 2013. “Adaptive Kernel Principal Component Analysis for Nonlinear Dynamic Process Monitoring.” In Control Conference (ASCC), 2013 9th Asian, 1–6. doi:10.1109/ASCC.2013.6606291.

Dong, Dong, and Thomas J. McAvoy. 1996. “Batch Tracking via Nonlinear Principal Component Analysis.” AIChE J. 42 (8): 2199–2208. doi:10.1002/aic.690420810.

Ge, Zhiqiang, Chunjie Yang, and Zhihuan Song. 2009. “Improved Kernel PCA-Based Monitoring Approach for Nonlinear Processes.” Chem. Eng. Sci. 64 (9): 2245–55. doi:10.1016/j.ces.2009.01.050.

Kazor, Karen, Ryan W. Holloway, Tzahi Y. Cath, and Amanda S. Hering. 2016. “Comparison of Linear and Nonlinear Dimension Reduction Techniques for Automated Process Monitoring of a Decentralized Wastewater Treatment Facility.” Stoch. Environ. Res. Risk Assess. 30 (5): 1527–44. doi:10.1007/s00477-016-1246-2.

Kresta, James V., John F. MacGregor, and Thomas E. Marlin. 1991. “Multivariate Statistical Monitoring of Process Operating Performance.” Can. J. Chem. Eng. 69 (1): 35–47. doi:10.1002/cjce.5450690105.

Miao, Aimin, Zhihuan Song, Zhiqiang Ge, Le Zhou, and Qiaojun Wen. 2013. “Nonlinear Fault Detection Based on Locally Linear Embedding.” J. Control Theory Appl. 11 (4): 615–22. doi:10.1007/s11768-013-2102-2.

Sanchez-Fernandez, A., M. J. Fuente, and G. I. Sainz-Palmero. 2015. “Fault Detection in Wastewater Treatment Plants Using Distributed PCA Methods.” In 2015 IEEE 20th Conference on Emerging Technologies Factory Automation (ETFA), 1–7. doi:10.1109/ETFA.2015.7301504.

Tenenbaum, Joshua B., Vin de Silva, and John C. Langford. 2000. “A Global Geometric Framework for Nonlinear Dimensionality Reduction.” Science 290 (5500): 2319–23. doi:10.1126/science.290.5500.2319.

Weinberger, Kilian Q., and Lawrence K. Saul. 2006. “Unsupervised Learning of Image Manifolds by Semidefinite Programming.” Int. J. Comput. Vision 70 (1): 77–90. doi:10.1007/s11263-005-4939-z.

Wise, B. M., D. J. Veltkamp, B. Davis, N. L. Ricker, and B. R. Kowalski. 1988. “Principal Components Analysis for Monitoring the West Valley Liquid-Fed Ceramic Melter.” In Management of Radioactive Wastes, and Non-Radioactive Wastes from Nuclear Facilities. Vol. 20. Tucson, AZ.